1 Pig Functions

2 Pig Built-in Functions

3 Some Example of Pig Built-in Functions

3.1 AVG() Function

  • The dataset: students GPA data of different terms.

    John Smith, fall 2000, 3.9
    John Smith, winter 2000, 3.8
    John Smith, spring 2001, 4.0
    John Smith, summer 2001, 3.9
    Mary Clark, fall 2000, 3.5
    Mary Clark, winter 2000, 3.0
    Mary Clark, spring 2001, 3.8
    Mary Clark, summer 2001, 3.9
  • Calculate the average GPA of each student.

3.2 ROUND_TO() Function

  • Returns the value of an expression rounded to a fixed number of decimal digits. Note that this function is different from the ROUND() function, which returns the value of an expression rounded to an integer.
  • The syntax:

    ROUND_TO(val, digits)
  • val: an expression whose result is type float or double: the value to round.
  • digits: an expression whose result is type int: the number of digits to preserve.
  • mode: an optional int specifying the rounding mode, according to the constants Java provides.

3.3 MAX() and MIN() Functions

  • Computes the maximum or minimum of the numeric values or chararrays in a single-column bag. MAX() or MIN() requires a preceding GROUP ALL statement for global maximums or minimums and a GROUP BY statement for group maximums or minimums.

3.4 SUBTRACT() Functions

  • SUBTRACT() takes two bags as arguments and returns a new bag composed of the tuples of first bag are that not in the second bag.
  • The implementation assumes that both bags being passed to the SUBTRACT() function will fit entirely into memory simultaneously; if this is not the case, it will still function but will be very slow.
  • Find out the bag elements that are in the first bag but not in the second bag:

    ({(8,9),(0,1),(1,2)},{(8,9),(1,1)})
    ({(2,3),(4,5)},{(2,3),(4,5)})
    ({(3,7),(3,7)},{(2,2),(3,7)})
    ({(1,2),(3,4),(5,6),(7,8)},{(2,3),(1,2)})

3.5 ENDSWITH() and STARTSWITH() Functions

  • Tests inputs to determine if the first argument ends or starts with the string in the second argument. Returns true or false.
  • Syntax:

    ENDSWITH(string, testAgainst)
    STARTSWITH(string, testAgainst)
  • Examples:

    ENDSWITH ('foobar', 'foo') --> false
    ENDSWITH ('foobar', 'bar') --> true
    STARTSWITH ('foobar', 'foo') --> true
    STARTSWITH ('foobar', 'bar') --> false

3.6 LTRIM(), RTRIM() and TRIM() Functions

  • LTRIM(): Returns a copy of a string with only leading white space(s) removed.
  • RTRIM(): Returns a copy of a string with only trailing white space(s) removed.
  • TRIM(): Returns a copy of a string with leading and trailing white space(s) removed.

3.7 SUBSTRING() Function

  • Returns a substring from a given string.
  • Syntax:

    SUBSTRING(string, startIndex, stopIndex)
  • string: the string from which a substring will be extracted.
  • startIndex: the index (type int) of the first character of the substring.
  • stopIndex: the index (type int) of the character following the last character of the substring.
  • Example:

    SUBSTRING("Cornell", 0, 4) --> "Corn"

3.8 TOTUPLE() Function

  • Converts one or more expressions to type tuple.
  • Syntax:

    TOTUPLE(expression [, expression ...])
  • Example:

    a = LOAD 'students' AS (name:chararray, age:int, gpa:float);
    DUMP a;
    (John,18,4.0)
    (Mary,19,3.8)
    (Bill,20,3.9)
    (Joe,18,3.8)
    b = FOREACH a GENERATE TOTUPLE(name, age, gpa);
    DUMP b;
    ((John,18,4.0))
    ((Mary,19,3.8))
    ((Bill,20,3.9))
    ((Joe,18,3.8))

3.9 TOBAG() Function

  • Converts one or more expressions to type bag.
  • Syntax:

    TOBAG(expression [, expression ...])
  • Example:

    a = LOAD 'students' AS (name:chararray, age:int, gpa:float);
    DUMP a;
    (John,18,4.0)
    (Mary,19,3.8)
    (Bill,20,3.9)
    (Joe,18,3.8)
    b = FOREACH a GENERATE TOBAG(name, gpa);
    DUMP b;
    ({(John),(4.0)})
    ({(Mary),(3.8)})
    ({(Bill),(3.9)})
    ({(Joe),(3.8)})

3.10 TOMAP() Function

  • Converts key/value expression pairs into a map.
  • Syntax:

    TOMAP(key-expression, value-expression[, key-expression, value-expression ...])
  • Example:

    a = LOAD 'students' AS (name:chararray, age:int, gpa:float);
    DUMP a;
    (John,18,4.0)
    (Mary,19,3.8)
    (Bill,20,3.9)
    (Joe,18,3.8)
    b = FOREACH a GENERATE TOMAP(name, gpa);
    DUMP b;
    ([John#4.0])
    ([Mary#3.8])
    ([Bill#3.9])
    ([Joe#3.8])

4 Pig UDFs

5 Python UDFs

5.1 Registering Python UDFs

  • When you use your own UDF, you have to tell Pig where to look for that UDF. This is done via the register command.
  • The register command is used to locate resources for Python UDFs that you use in your Pig Latin scripts. In this case you register a Python script that contains your UDF. The Python script must be in your current directory. For simplicity, in this class you may put all these, including your data file, at the system root.
  • Your Python script is a normal text file (here, a CentOS file, which can be easily composed with the vi editor, or a python file written in your MacOS/Windows).
  • Your data file must be loaded into the HDFS file system (using hadoop fs -copyFromLocal ..., or Ambari’s Files View).

5.2 Count the Number of Characters of Each Line of the poem.txt File with a Python UDF in Pig

There is Another Sky

Emily Dickinson

There is another sky,
Ever serene and fair,
And there is another sunshine,
Though it be darkness there;
Never mind faded forests, Austin,
Never mind silent fields -
Here is a little forest,
Whose leaf is ever green;
Here is a brighter garden,
Where not a frost has been;
In its unfading flowers
I hear the bright bee hum:
Prithee, my brother,
Into my garden come!

5.3 The Python UDF

  • Write the Python UDF as a Python script pyudf0.py in the vi editor.
  • You need to expose this function as a UDF with a Python decorator by adding the @outputSchema decorator. This specifies a schema for the return value.

    @outputSchema("length:int")
    def get_length(data):
    length = len(data)
    return length

5.4 The Pig Script to Use the Python UDF

5.5 An Improved UDF pyudf1.py

  • This version counts both the number of characters and the number of words for each line of the text.
  • Add "# of characters = " and "# of words = " before the respective values.

    @outputSchema("nums:chararray")
    def get_length(data):
    words = data.split()
    num_chars = len(data)
    num_words = len(words)
    num = "# of characters = " + str(num_chars), "# of words = " + str(num_words)
    return nums

5.6 The New Pig Script to Use the Python UDF

5.7 Apply the Same UDF to a Different Data File